813 research outputs found

    A genetic algorithm for interpretable model extraction from decision tree ensembles

    Get PDF
    Models obtained by decision tree induction techniques excel in being interpretable. However, they can be prone to overfitting, which results in a low predictive performance. Ensemble techniques provide a solution to this problem, and are hence able to achieve higher accuracies. However, this comes at a cost of losing the excellent interpretability of the resulting model, making ensemble techniques impractical in applications where decision support, instead of decision making, is crucial. To bridge this gap, we present the genesim algorithm that transforms an ensemble of decision trees into a single decision tree with an enhanced predictive performance while maintaining interpretability by using a genetic algorithm. We compared genesim to prevalent decision tree induction algorithms, ensemble techniques and a similar technique, called ism, using twelve publicly available data sets. The results show that genesim achieves better predictive performance on most of these data sets compared to decision tree induction techniques & ism. The results also show that genesim's predictive performance is in the same order of magnitude as the ensemble techniques. However, the resulting model of genesim outperforms the ensemble techniques regarding interpretability as it has a very low complexity

    An extensive experimental evaluation of automated machine learning methods for recommending classification algorithms

    Get PDF
    This paper presents an experimental comparison among four automated machine learning (AutoML) methods for recommending the best classification algorithm for a given input dataset. Three of these methods are based on evolutionary algorithms (EAs), and the other is Auto-WEKA, a well-known AutoML method based on the combined algorithm selection and hyper-parameter optimisation (CASH) approach. The EA-based methods build classification algorithms from a single machine learning paradigm: either decision-tree induction, rule induction, or Bayesian network classification. Auto-WEKA combines algorithm selection and hyper-parameter optimisation to recommend classification algorithms from multiple paradigms. We performed controlled experiments where these four AutoML methods were given the same runtime limit for different values of this limit. In general, the difference in predictive accuracy of the three best AutoML methods was not statistically significant. However, the EA evolving decision-tree induction algorithms has the advantage of producing algorithms that generate interpretable classification models and that are more scalable to large datasets, by comparison with many algorithms from other learning paradigms that can be recommended by Auto-WEKA. We also observed that Auto-WEKA has shown meta-overfitting, a form of overfitting at the meta-learning level, rather than at the base-learning level

    Measures and models for causal inference in cross-sectional studies: arguments for the appropriateness of the prevalence odds ratio and related logistic regression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Several papers have discussed which effect measures are appropriate to capture the contrast between exposure groups in cross-sectional studies, and which related multivariate models are suitable. Although some have favored the Prevalence Ratio over the Prevalence Odds Ratio -- thus suggesting the use of log-binomial or robust Poisson instead of the logistic regression models -- this debate is still far from settled and requires close scrutiny.</p> <p>Discussion</p> <p>In order to evaluate how accurately true causal parameters such as Incidence Density Ratio (IDR) or the Cumulative Incidence Ratio (CIR) are effectively estimated, this paper presents a series of scenarios in which a researcher happens to find a preset ratio of prevalences in a given cross-sectional study. Results show that, provided essential and non-waivable conditions for causal inference are met, the CIR is most often inestimable whether through the Prevalence Ratio or the Prevalence Odds Ratio, and that the latter is the measure that consistently yields an appropriate measure of the Incidence Density Ratio.</p> <p>Summary</p> <p>Multivariate regression models should be avoided when assumptions for causal inference from cross-sectional data do not hold. Nevertheless, if these assumptions are met, it is the logistic regression model that is best suited for this task as it provides a suitable estimate of the Incidence Density Ratio.</p

    DNA-Based Diet Analysis for Any Predator

    Get PDF
    Background: Prey DNA from diet samples can be used as a dietary marker; yet current methods for prey detection require a priori diet knowledge and/or are designed ad hoc, limiting their scope. I present a general approach to detect diverse prey in the feces or gut contents of predators. Methodology/Principal Findings: In the example outlined, I take advantage of the restriction site for the endonuclease Pac I which is present in 16S mtDNA of most Odontoceti mammals, but absent from most other relevant non-mammalian chordates and invertebrates. Thus in DNA extracted from feces of these mammalian predators Pac I will cleave and exclude predator DNA from a small region targeted by novel universal primers, while most prey DNA remain intact allowing prey selective PCR. The method was optimized using scat samples from captive bottlenose dolphins (Tursiops truncatus) fed a diet of 6–10 prey species from three phlya. Up to five prey from two phyla were detected in a single scat and all but one minor prey item (2% of the overall diet) were detected across all samples. The same method was applied to scat samples from free-ranging bottlenose dolphins; up to seven prey taxa were detected in a single scat and 13 prey taxa from eight teleost families were identified in total. Conclusions/Significance: Data and further examples are provided to facilitate rapid transfer of this approach to any predator. This methodology should prove useful to zoologists using DNA-based diet techniques in a wide variety of study systems

    Automated machine learning for studying the trade-off between predictive accuracy and interpretability

    Get PDF
    Automated Machine Learning (Auto-ML) methods search for the best classification algorithm and its best hyper-parameter settings for each input dataset. Auto-ML methods normally maximize only predictive accuracy, ignoring the classification model’s interpretability – an important criterion in many applications. Hence, we propose a novel approach, based on Auto-ML, to investigate the trade-off between the predictive accuracy and the interpretability of classification-model representations. The experiments used the Auto-WEKA tool to investigate this trade-off. We distinguish between white box (interpretable) model representations and two other types of model representations: black box (non-interpretable) and grey box (partly interpretable). We consider as white box the models based on the following 6 interpretable knowledge representations: decision trees, If-Then classification rules, decision tables, Bayesian network classifiers, nearest neighbours and logistic regression. The experiments used 16 datasets and two runtime limits per Auto-WEKA run: 5 h and 20 h. Overall, the best white box model was more accurate than the best non-white box model in 4 of the 16 datasets in the 5-hour runs, and in 7 of the 16 datasets in the 20-hour runs. However, the predictive accuracy differences between the best white box and best non-white box models were often very small. If we accept a predictive accuracy loss of 1% in order to benefit from the interpretability of a white box model representation, we would prefer the best white box model in 8 of the 16 datasets in the 5-hour runs, and in 10 of the 16 datasets in the 20-hour runs
    corecore